A short guide on where to safely eat delicious food in Chicago

Table Of Contents

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as ss
import matplotlib.pyplot as plt
from matplotlib import rcParams
# figure size in inches
rcParams['figure.figsize'] = 8,5
import seaborn as sns
import os, sys
import re
from datetime import datetime, date, time
import json
import warnings
warnings.filterwarnings("ignore")
from shapely.geometry import shape, Point
from area import area
from pathlib import Path
from difflib import get_close_matches

# sklearn imports
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Interactive plots
from folium_plots import *
import plotly.graph_objects as go
import branca
import folium
from folium import plugins
from folium.plugins import HeatMap
from folium.plugins import TimestampedGeoJson

# load the functions for the plots and allow easy reloading
import folium_plots
import sys
if sys.version_info[0] == 3:
    from importlib import reload
In [2]:
# longitude / latitude of chicago
chicago_coords = [41.8781, -87.6298]
# geojson boundaries of the community areas
area_geo_path = r'data/geojson/boundaries.geojson'
geo_json_data = json.load(open(area_geo_path))
# geojson city boundary
city_geo_path = r'data/geojson/city.geojson'
geo_json_chicago_area = json.load(open(city_geo_path))

A. Data collection

1. Chicago Food Inspections (CFI) dataset

This is our main dataset, where the information is derived from inspections of restaurants and other food establishments in Chicago from January 1, 2010 to the present. Inspections are performed by staff from the Chicago Department of Public Health’s Food Protection Program using a standardized procedure. This dataset contains 22 columns and as of October 23, 2019 there are 194’784 entries (updated frequently, so subject to change).

In [3]:
cfi = pd.read_csv('data/food-inspections.csv')
print('There are {0} samples and {1} columns in the dataset'.format(cfi.shape[0],cfi.shape[1]))
There are 194784 samples and 22 columns in the dataset

2. Yelp Reviews (YR) dataset

Yelp.com is a website that helps people find great local businesses such as restaurants based on customer reviews. Each business has a page where visitors can write comments and add a general grade reflecting their experience there. The Yelp fusion API enables us to access relevant information on restaurants such as their rating and their respective number of reviews. When trying to scrap the data, we first encountered an issue of how to uniquely identify each restaurant. Indeed, the Yelp API enables us to pass as an argument the latitude and longitude of all places we want to query. We initially thought that this will enable us to find the right businesses. However, the Yelp coordinates do not match exactly the ones of our dataset. Thus, quering by latitude/longitude didn't give us the right business (we generally obtained restaurants, which were closed to the one in which we were interested).

Therefore, we had to come up with another way to query our dataset. By looking at the API documentation on Yelp Fusion website, we found another function, called business_match_query, which enables us to give as argument the name and complete address of every restaurant. Trying this function on a few cases showed us that by doing this "cross-research", taking into account several pieces of information on the businesses, we obtain the right results. However, one disadvantage of this function is that the business_match_query doesn't enable us to access directly the Yelp reviews. But it gives us the Yelp ID, which is a unique identifier on Yelp website for a business. As a next step, we can use the requests package to access the reviews by giving it as an argument the Yelp ID.

This way, we were able to obtain the right results. One disadvantage of this method is the fact that we need to query the dataset twice. First to obtain the Yelp ID via the business_match_query function, and secondly to obtain the reviews using the ID. For this reason, we haven't queried our whole dataset yet. We should be able to have the full dataset ready before Wednesday 27th November (as Yelp restricts us to 5'000 queries per day).

For the code of the scraping, please refer to the file called API_yelp. In this file, we simply implemented the procedure explained above. Every day, we scrapped 2'500 rows of our dataset, then saved it as a csv file, and re-open this file the next day to continue the scraping and so on (of course, we made some copies of it during the procedure to make sure to not lose any information).

3. Crimes dataset

This dataset reflects reported incidents of crime that occurred in the City of Chicago from 2001 to present, and is updated daily with a seven-day lag. For our project, we will however only focus on the data from January 1, 2010 onwards to match with the CFI dataset. Data is extracted from the Chicago Police Department's CLEAR system and due to privacy concerns, addresses are only shown at the block level. This level of resolution is however more than sufficient for the analyses we aim to undertake. This dataset contains 30 columns and as of October 23, 2019 there are 6'992’400 entries (updated frequently, so subject to change).

In [4]:
crimes = pd.read_csv('data/crimes.csv', low_memory=False)
print('There are {0} samples and {1} columns in the dataset'.format(crimes.shape[0],crimes.shape[1]))
There are 6992400 samples and 30 columns in the dataset

B. Data preprocessing

1. Chicago Food Inspections (CFI) dataset

The set is in a simple csv file, which makes it easy to read using Pandas. In this dataset, we have information on the location of the restaurants, where 7 columns contain spatial indicators, excluding the columns on the community area and zip codes, which are full of NaN values and will hence be disregarded. Information on the inspections consist of 5 columns including the inspection type, date, id, results and possible violations. The remaining columns are related to the restaurant itself, including the facility type, licence number, DBA, AKA names and risk.

In [5]:
# Getting a first feel for the dataset
with pd.option_context('display.max_columns', None): display(cfi.head(2))
Inspection ID DBA Name AKA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards
0 2320260 GOPUFF GOPUFF 2684560.0 Grocery Store Risk 3 (Low) 1801 W WARNER AVE CHICAGO IL 60613.0 2019-10-22T00:00:00.000 License Fail NaN 41.956846 -87.674395 {'longitude': '41.956845683288854', 'latitude'... NaN NaN NaN NaN NaN
1 2320248 THE REDHEAD PIANO BAR THE REDHEAD PIANO BAR 2313942.0 Restaurant Risk 3 (Low) 16-18 W ONTARIO ST CHICAGO IL 60654.0 2019-10-22T00:00:00.000 License Pass w/ Conditions NaN 41.893371 -87.628783 {'longitude': '41.893370903547904', 'latitude'... NaN NaN NaN NaN NaN

1.1 General column exploration

Some columns only contain NaNs values, we chose to remove them intergrally. Then we remove duplicates since we don't want to consider the same information several times. Our analysis is based on food inspection, inspection rows with Out of Business or Business Not Located inspection results were also removed since no inspection happened. Finally, all restaurants should have a unique license number and since we aim to use that for merging etablishments, we get rid of the inpections for which the license number was not specified.

In [6]:
# Drop columns that only contain NaNs
cfi.drop(columns = ['Historical Wards 2003-2015', 'Zip Codes', 'Community Areas', 'Census Tracts', 'Wards'], \
         inplace = True)

# Drop duplicate rows (121 duplicate rows)
cfi.drop_duplicates() 

# Remove inspection for which buisnesses were closed or not found
cfi = cfi[(cfi['Results'] != 'Out of Business') & (cfi['Results'] != 'Business Not Located')]

# Remove businesses without any license number
cfi = cfi[(cfi['License #'] != 0)]

The non-empty columns of the dataset are the following:

DBA Name - "Doing Business As", this is the legal name of the business.
AKA Name - The name that the public knows the establishment as.
License Number - This is a unique number for each establishment.
Facility Type - the type of establishment (ex. grocery store, bakery, restaurant, etc.).
Risk - Each establishment has a different risk of adversely affecting the public health. Risk 1 is the highest and Risk 3 is the lowest. Lower risk establishments have inspections less often.
Address - The address of the establishment.
City - The city of the establishement.
State - The state of the establishement, in a two letter format.
Inspection Date - The date of the inspection.
Inspection Type - The type of inspection: Canvass (the most common, based on the risk of the establishment), Consultation (at the request of the owner before the opening of the establishment), Complaint (the inspection is done following a complaint on the establishment), License (inspection so the establishment can receive its license to operate), Suspect food poisoning (inspection done in response to people claiming getting ill after eating at the establishment), Task-force inspection (inspection of a Tavern or Bar), Re-inspections (can occur for many types).
Result - The inspection can be a pass, pass w/ conditions, or a fail.
Violations - An establishment can receive one or more of the 45 different violations. The requirement of the establishment to not receive the violation is also noted.

Looking at the empty values in the dataframe

In [7]:
# Count the number of null values in each of the columns of the DataFrame 
cfi.isnull().sum()
Out[7]:
Inspection ID          0
DBA Name               0
AKA Name            1859
License #             16
Facility Type        802
Risk                  41
Address                0
City                 109
State                 40
Zip                   32
Inspection Date        0
Inspection Type        1
Results                0
Violations         34713
Latitude             637
Longitude            637
Location             637
dtype: int64

From the results above, the following will be removed:

  • Risk: the 61 missing entries have to be discarded
  • Facility Type (see next section: Facility inspection, the type will try to be retrieved from the DBA Name)
  • Latitude and Longitude will have to be dropped since we will use it for the comparison with the Crime Dataset (see the Locations inspection section)
In [8]:
# Removing the null values in the Risk column 
cfi.dropna(subset=['Risk'], inplace = True)

1.2 Facility inspection

At this step, we want to remove all facility types which do not sell and/or produce their own food. That is, we want to exclude grocery stores, gas stations, schools, etc. Furthermore, we separate them in 9 different categories depending on the type of service they offer. To do so, we use REGEX to search for the corresponding key words.

In [9]:
# Initializing the regex 
REST = """"rest|diner|sushi|taqueria|hot.dog|grill|cuisine|jap.nese|sandwich|bbq|pizza|burrito
|mcdo|mc do|kimchee|chicken|pollo|fish|pasta|taco|rice|fast.food|bagel|noodle|sub|pita|kebab|kabob
|seafood|shrimp|thai|african|chinese|burger|porkchop|wendy"""
BAR = "bar|hooka|bistro|brewery"
TAVERN = "tavern"
COFFEE = "cafe|caffe|coffee"
DESSERT = "gelato|ice.cream|paleteria|dessert|cand.|chocolate|cookies|cake|patisserie|sweet"
BAKERY = "bakery|donut"
DRINKS = "shake|juice|smoothie|tea"
SNACK = "snack|snak|popcorn"
DELI = "deli"
regexes = [REST, BAR, TAVERN, COFFEE, DESSERT, BAKERY, DRINKS, SNACK, DELI]
types = ['Restaurant', 'Bar', 'Tavern', 'Coffee', 'Dessert', 'Bakery', 'Drinks', 'Snaks', 'Deli']

# Get rows with no facility type
no_facility = cfi['Facility Type'].isnull() 

# Change the facility type according to the regexes
for i, regex in enumerate(regexes):
    facility = cfi['Facility Type'].str.contains(pat = regex, case=False, regex=True, na=False)
    dba = cfi['DBA Name'].str.contains(pat = regex, case = False, regex = True, na=False)
    cfi.loc[facility|(no_facility & dba), 'Facility Type'] = types[i]
    
# Drop rows with no and not of interest facility type
cfi.dropna(axis = 0, subset = ['Facility Type'], inplace = True)
index_names = cfi[~cfi['Facility Type'].isin(types)].index
cfi.drop(index_names, inplace = True)

# Show the edited Facility column
cfi.sample(2)
Out[9]:
Inspection ID DBA Name AKA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Location
148487 1229983 KENTUCKY FRIED CHICKEN 521035 KENTUCKY FRIED CHICKEN 521035 1816800.0 Restaurant Risk 1 (High) 5852 S WESTERN AVE CHICAGO IL 60636.0 2012-07-05T00:00:00.000 Complaint Fail 3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR... 41.786788 -87.683985 {'longitude': '41.786788085900675', 'latitude'...
126100 1092569 TRIPOLI TAVERN TRIPOLI TAVERN 1403566.0 Restaurant Risk 1 (High) 1147 W ARMITAGE AVE CHICAGO IL 60614.0 2013-08-22T00:00:00.000 Canvass Pass 30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABEL... 41.917939 -87.657224 {'longitude': '41.917939272072275', 'latitude'...

1.3 Locations inspection

There are no rows that don't have one of either latitude, longitude and address. The address, however, is insufficient, and thus all samples without latitude & longitude will be removed.

In [10]:
# Drop location with missing latitude or longitude
no_location = cfi[cfi['Latitude'].isnull() | cfi['Longitude'].isnull()].index
cfi.drop(no_location, inplace = True)

# Plot the locations
sns.jointplot(x=cfi['Latitude'], y=cfi['Longitude'], kind='hex')
plt.show()

This scatter plot shows that all of the data does seem to really be in Chicago or in the surrounding area.

1.4 Adjusting columns

Several columns have to be processed, changed or added for our subsequent analyses.

1.4.1 The Risk column

Simplify the risk column to a risk with an integer value from 1-3.

In [11]:
print('The unique values in the Risk column are: ', cfi['Risk'].unique())

# Converting the Risk column to int values varying from 1-3 (Risk 1 = 1 etc.)
def risk_to_int(string):
    switcher = { 
        'Risk 1 (High)' or 'All': 1, 
        'Risk 2 (Medium)': 2, 
        'Risk 3 (Low)': 3, 
    }
    return switcher.get(string, 0) 
The unique values in the Risk column are:  ['Risk 3 (Low)' 'Risk 1 (High)' 'Risk 2 (Medium)' 'All']
In [12]:
# Converting the risk columns to int values
cfi['Risk']=cfi['Risk'].apply(risk_to_int)

1.4.2 The Inspection Date column

Use the date formate in the Inspection Date column for future use.

In [13]:
# Convert Inspection date in date format
cfi['Inspection Date'] = cfi['Inspection Date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f'))

1.4.3 The unused columns: AKA Name and Location

Drop the columns that are of no use in our analyses.

In [14]:
# Drop unused columns
cfi.drop(columns = ['AKA Name', 'Location'], inplace = True)

1.4.4 Add a column: Community Area Code

This will be useful when working together with the crimes dataset to match the information into geographical groups (i.e. the community areas identified by a unique code from 1 to 77).

In [15]:
# create a new column storing the Point necessary for polygon testing later on
cfi['Location Point'] = cfi.apply(lambda x: Point(x.Longitude, x.Latitude), axis=1)
In [16]:
# define the polygons and corresponding area codes for all areas 
polygons_list = []
areacode_list = []
for feature in geo_json_data['features']:
    polygons_list.append(shape(feature['geometry']))
    areacode_list.append(int(feature['properties']['area_num_1']))
In [17]:
def find_area(point):
    for idx, polygon in enumerate(polygons_list):
        if polygon.contains(point):
            return areacode_list[idx]
    return np.nan
In [18]:
cfi['Community Area'] = cfi['Location Point'].apply(lambda x: find_area(x))

Let's verify that all inspections are attributed to a community area:

In [19]:
cfi['Community Area'].isnull().sum()
Out[19]:
2398

There are over 2000 inspections that could not be attributed to a community area. Let's inspect them to understand why this happened and fix the values.

In [20]:
# Find the different locations that could not be attributed
print(cfi[cfi['Community Area'].isnull()]['Latitude'].unique())
print(cfi[cfi['Community Area'].isnull()]['Longitude'].unique())
[42.0085364  41.89224916 41.89233781]
[-87.91442844 -87.60951805 -87.60404476]
In [130]:
# plot these location in the map together with the boundaries of the community areas
m = folium.Map(location=[41.8781, -87.6298])

# add the boundaries to the map
folium.GeoJson(geo_json_data).add_to(m)

folium.Marker([42.0085364, -87.91442844]).add_to(m)
folium.Marker([41.89224916, -87.60951805]).add_to(m)
folium.Marker([41.89233781, -87.60404476]).add_to(m)

m.save('data/html/nan_analysis.html')

#m

By visual inspection of the plot found here, we can see that these three locations are slightly out of the official area as defined by the geojson data, but they still clearly belong to Chicago, so we will manually add their corresponding community area to the data.

In [22]:
# retrieve the indices corresponding to the nan values
idx_nan = cfi[cfi['Community Area'].isnull()].index

# retrieve the full values of the latitude to allow comparison (values yielded by unique() not sufficient)
lat_collection = []
for i in idx_nan:
    lat = cfi['Latitude'][i]
    if lat not in lat_collection:
        print(lat)
        lat_collection.append(lat)
42.008536400868735
41.892249163400116
41.89233780863412

The first latitude value corresponds to area 76, and the other two are in area 8. Let's hence add the missing values to the dataset.

In [23]:
# add the missing area codes
cfi.loc[cfi['Community Area'].isnull(), 'Community Area'] = \
cfi[cfi['Community Area'].isnull()]['Latitude'].apply(lambda x: 76 if (x == lat_collection[0]) else 8)

# convert the codes to ints now that no nan values left
cfi['Community Area'] = cfi['Community Area'].astype(int)

Check there are no NaN values left:

In [24]:
cfi['Community Area'].isnull().sum()
Out[24]:
0

The Location Point can now be dropped again.

In [25]:
cfi.drop(columns=['Location Point'], inplace = True)

1.4.5 Add a column: Food chain

Now let's make a dataset grouping the food chains. According to Wikipedia, the top 15 biggest food chains of the US are: Subway, McDonald's, Starbucks, KFC, Burger King, Pizza Hut, Domino's, Dunkin', Baskin-Robbins, Hunt Brothers Pizza, Taco Bell, Wendy's, Hardee's, Orange Julius and Papa John's Pizza.

In [26]:
def get_chain_column(x, to_compare, names, cutoff):
    clean_x = re.sub('[\W\d]','', x).lower()
    match = get_close_matches(clean_x, to_compare, n=1, cutoff=cutoff)
    if(match):
        return names[chains_clean.index(match[0])]
    else:
        return 'None'
In [27]:
chains_names = ['Subway', 'McDonald\'s', 'Starbucks', 'KFC', 'Burger King', 'Pizza Hut', 'Domino\'s',\
                 'Dunkin\'', 'Baskin-Robbins', 'Hunt Brothers Pizza', 'Taco Bell', 'Wendy\'s',\
                 'Hardee\'s', 'Orange Julius', 'Papa John\'s Pizza']
chains_clean = [re.sub('[\W\d]','', x).lower() for x in chains_names]
cutoff = 0.9

cfi['Chain'] = cfi['DBA Name'].apply(get_chain_column, args=(chains_clean, chains_names, cutoff))

1.4.6 Adapating the violations column

From 2010 to 7/1/2018:

Given that there are 45 violations possible from a food inspection, and that these violations fall into three categories:

  • Critical: violations number 1-14
  • Serious: violations number 15-29
  • Minor: violations number 30-45

We can make a final score for each inspection based on the category of the violation. Since an inspection can only lead to a Fail result when serious or critical violations have been delivered, we will count the number of these types of violations made by each inspection. This yields the Violation Score.

From 7/1/2018 to present:

The violations increased from 45 to 63 and the terms used to classify the different violations has also changed.

  • "Critical Violation" has been changed to “Priority (P) Violation"
  • "Serious Violation" has been changed to "Priority Foundation (PF) Violation"
  • "Minor Violation" has been changed to "Core (C) Violation"

Unfortunately, the violations which correspond to these new categories are not as clearly defined as the old violation system (critical, serious, minor). Thus, using the criteria which define each type of category of violation (available online), we manually categorized each violation. See: https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/understand_healthcoderequirementsforfoodestablishments.html

1.4.6.1 Separating the new and old violations code

We will begin by doing the analysis on inspections done before 7/1/2018. For this we will make a new column: 'Violation Score Old', where 'old' is for the old violation code (before 7/1/2018), and 'new' is for the new violation code (after 7/1/2018).

In [28]:
# Making new dataframes to separate the old and new system
cfi_VOld = cfi[cfi['Inspection Date'] < '2018-07-01'].copy()
cfi_VNew = cfi[cfi['Inspection Date'] >= '2018-07-01'].copy()
In [29]:
# Old version of the violation code

# Need to do two searches for the violation number.

# The first search: finding the first number (without a ' | ' separation):
CRITICAL_first = [str(i) + '. ' for i in range(1,10)]
CRITICAL_first2 = [str(i) + '. ' for i in range(10,15)] # Longer string values (10-15) need to be separate for search
SERIOUS_first = [str(i) + '. ' for i in range(15,30)]

# The second search: the other numbers which come with a ' | ' separation before
# (note all the violation numbers are followed by a '. ' which helps in the search)

CRITICAL = "\| 1\. |\| 2\. |\| 3\. |\| 4\. |\| 5\. |\| 6\. |\| 7\. |\| 9\. |\| 10\. |\| 11\. |\| 12\. |\| 13\. |\| 14\. "
SERIOUS = "\| 15\. |\| 16\. |\| 17\. |\| 18\. |\| 19\. |\| 20\. |\| 21\. |\| 23\. |\| 24\. |\| 25\. |\| 26\. |\| 27\. |\| 28\. |\| 29\. "

# The first search: finding the first numbers
cfi_VOld['Score Violation First CRITICAL'] = np.where(cfi_VOld['Violations'].str[0:3].isin(CRITICAL_first) \
                                             | cfi_VOld['Violations'].str[0:4].isin(CRITICAL_first2), 1, 0)
cfi_VOld['Score Violation First SERIOUS'] = np.where(cfi_VOld['Violations'].str[0:4].isin(SERIOUS_first), 1, 0)

# The second search: finding all subsequent numbers with the ' | ' separation
cfi_VOld['Score Violation Others'] = cfi_VOld['Violations'].str.count(CRITICAL)\
                                + cfi_VOld['Violations'].str.count(SERIOUS)

# Adding all the columns to get the final score 
cfi_VOld['Score Violation Old'] = cfi_VOld['Score Violation First CRITICAL'] + \
                                cfi_VOld['Score Violation First SERIOUS'] + \
                                cfi_VOld['Score Violation Others']

# Dropping the other columns
cfi_VOld.drop(columns = ['Score Violation First CRITICAL', 'Score Violation First SERIOUS', \
                            'Score Violation Others'], inplace = True)

cfi_VOld.head(1)
Out[29]:
Inspection ID DBA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Community Area Chain Score Violation Old
21900 2182172 PEPE'S RESTAURANT 7056.0 Restaurant 1 7026 W ARCHER AVE CHICAGO IL 60638.0 2018-06-29 Complaint Pass 32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL... 41.79214 -87.797669 56 None 0.0

Let's have a quick look at the distribution of the scores based on the old scoring system:

In [30]:
cfi_VOld['Score Violation Old'].hist(bins = 10)
plt.title('Histogram of Violation score', Fontsize = 20)
plt.ylabel('Number of inspections')
plt.xlabel('Violation Score')
Out[30]:
Text(0.5, 0, 'Violation Score')
1.4.6.2 The new scoring system

As stated before, we need to manipulate the new violation code to make it work with the old system. Thus, using the criteria which define each type of category of violation (available online), we manually categorized each violation.

In [31]:
# Priority Violation
PV = "\| 6\. |\| 7\. |\| 8\. |\| 9\. |\| 10\. |\| 11\. |\| 12\. |\| 13\. |\| 14\. |\| 18\. |\| 19\. |\| 20\. |\| 21\. |\| 23\. |\| 24\. |\| 26\. |\| 27\. |\| 28\. |\| 30\. |\| 31\. |\| 32\. |\| 33\. |\| 36\. |\| 38\. |\| 40\. |\| 42\. |\| 46\. |\| 50\. |\| 51\. |\| 52\. |\| 53\. |\| 54\. |\| 55\."
# Priority Foundation Violation
PFV = "\| 1\. |\| 2\. |\| 3\. |\| 4\. |\| 5\. |\| 15\. |\| 16\. |\| 17\. |\| 34\. |\| 35\. |\| 37\. |\| 39\. |\| 25\. |\| 26\. |\| 27\. |\| 28\. |\| 29\. |\| 56\. |\| 57\. |\| 58\. "

PV_first = ['6. ', '7. ', '8. ','9. ']
PV_first2 = ['10. ', '11. ', '12. ', '13. ', '14. ', '18. ', '19. ', '20. ', '21. ', '23. ', '24. ', '26. ', '27. ', '28. ', '30. ', '31. ', '32. ', '33. ', '36. ', '38. ', '40. ','42. ', '46. ', '50. ', '51. ', '52. ','53. ','54. ','55. ']
PFV_first = ['1. ','2. ','3. ','4. ','5. ']
PFV_first2 = ['15. ','16. ',' 17. ',' 34. ','35. ', '37. ', '39. ','25. ', '26. ','27. ','28. ','29. ','56. ','57. ','58. ']

cfi_VNew['Score Violation First PV'] = np.where(cfi_VNew['Violations'].str[0:3].isin(PV_first) \
                                             | cfi_VNew['Violations'].str[0:4].isin(PV_first2), 1, 0)
cfi_VNew['Score Violation First PFV'] = np.where(cfi_VNew['Violations'].str[0:3].isin(PFV_first) \
                                             | cfi_VNew['Violations'].str[0:4].isin(PFV_first2), 1, 0)

cfi_VNew['Score Violation Others'] = cfi_VNew['Violations'].str.count(PV)\
                                + cfi_VNew['Violations'].str.count(PFV)

cfi_VNew['Score Violation Recent'] = cfi_VNew['Score Violation First PV'] + \
                                cfi_VNew['Score Violation First PFV'] + \
                                cfi_VNew['Score Violation Others']

cfi_VNew.drop(columns = ['Score Violation First PV', 'Score Violation First PFV', \
                            'Score Violation Others'], inplace = True)

cfi_VNew.head(1)
Out[31]:
Inspection ID DBA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Community Area Chain Score Violation Recent
1 2320248 THE REDHEAD PIANO BAR 2313942.0 Restaurant 3 16-18 W ONTARIO ST CHICAGO IL 60654.0 2019-10-22 License Pass w/ Conditions NaN 41.893371 -87.628783 8 None NaN
1.4.6.3 Score distributions

Let's analyze now the score distribution of the new scoring system:

In [32]:
cfi_VNew['Score Violation Recent'].hist(bins = 25)
plt.title('Histogram of Violation score', Fontsize = 20)
plt.ylabel('Number of inspections')
plt.xlabel('Violation Score')
Out[32]:
Text(0.5, 0, 'Violation Score')
In [33]:
cfi_VOld['Score Violation Old'].describe()
Out[33]:
count    89868.000000
mean         0.648629
std          1.057094
min          0.000000
25%          0.000000
50%          0.000000
75%          1.000000
max         10.000000
Name: Score Violation Old, dtype: float64
In [34]:
cfi_VNew['Score Violation Recent'].describe()
Out[34]:
count    11893.000000
mean         5.707475
std          3.667515
min          0.000000
25%          3.000000
50%          5.000000
75%          8.000000
max         27.000000
Name: Score Violation Recent, dtype: float64
In [35]:
cfi['Score Violation Old'] = cfi_VOld['Score Violation Old']
cfi['Score Violation Recent'] = cfi_VNew['Score Violation Recent']

# Replacing the Nan values by zeros, since it means there were no violations
cfi['Score Violation Old'] = cfi['Score Violation Old'].fillna(0)
cfi['Score Violation Recent'] = cfi['Score Violation Recent'].fillna(0)

# Summing the old and the new: to get the full results
cfi['Score Violation'] = cfi['Score Violation Recent'] + cfi['Score Violation Old'] 

# Food Inspection Score: for further transformations where higher score would represent a better result
cfi['Food Inspection Score'] = cfi['Score Violation Recent'] + cfi['Score Violation Old']

cfi.drop(columns = ['Score Violation Recent', 'Score Violation Old'], inplace = True)
1.4.6.4 Creating a common violation scoring system

As seen above, the distribution of the new and the old violation scores are very different . This is why we will:

  • Do a normalization of the violation scores (number of critical + serious violations) separately for the old and the new violation system. The main motivation to do this is to allow the transition from the old to the new violation system since the new system has more violations (yielding large scores compared to old scores). We are assuming that the number of violations in the new system would not have changed if it continued to follow the old system.
  • Do a min-max scaling in order to have a clear scale 0-10 for the violation score.
  • Invert the distribution by doing (max - old_value): that way a higher score means a better result, which is coherent with the Yelp score. Instead of a Violation Score we now have a Food Inspection Score that reflects the safety of the establishment.

Thus a Food Inspection Score of 10 means good food safety, and a score of 0 is a very low food safety.

In [36]:
date = pd.to_datetime('2018/7/1',infer_datetime_format=True)

# Normalize by the sum of all violations during the period (old/new) and multiply by number of days
sum_ = cfi['Score Violation'][cfi['Inspection Date'] < date].sum() / (date - cfi['Inspection Date'].min()).days
cfi['Food Inspection Score'][cfi['Inspection Date'] < date] = cfi['Food Inspection Score'][cfi['Inspection Date'] < date]/sum_
sum_ = cfi['Score Violation'][cfi['Inspection Date'] >= date].sum() / (cfi['Inspection Date'].max() - date).days
cfi['Food Inspection Score'][cfi['Inspection Date'] >= date] = cfi['Food Inspection Score'][cfi['Inspection Date'] >= date]/sum_ 
In [37]:
# Scale the Food Inspection Score to get a score between 0-10.
# Do this separately for the new and old system
X = cfi['Food Inspection Score'][cfi['Inspection Date'] < date]
cfi['Food Inspection Score'][cfi['Inspection Date'] < date] = 10*(X-min(X))/(max(X)-min(X))
X = cfi['Food Inspection Score'][cfi['Inspection Date'] >= date]
cfi['Food Inspection Score'][cfi['Inspection Date'] >= date] = 10*(X-min(X))/(max(X)-min(X))
In [38]:
# Change the distribution so that small is worse
cfi['Food Inspection Score'] = cfi['Food Inspection Score'].max() - cfi['Food Inspection Score'] 
In [39]:
cfi['Food Inspection Score'].hist(bins = 15)
plt.title('Histogram of Food Inspection Scores', Fontsize = 20)
plt.ylabel('Number of inspections')
plt.xlabel('Food inspection score')
Out[39]:
Text(0.5, 0, 'Food inspection score')
1.4.6.5 Making a smoothed Food Safety Score

Since the Yelp data is valid as of today, we will want a Food Inspection Score that largely bases itself on the recent inspection outcomes. We also want to take into account the outcome of the previous inspections since they can also reflect the current Food Safety of the establishment, but just with less importance. Thus the smoothed score should take into account the old scores just with less weighting, ensuring that the score really takes into account the recent inspections, but does not ignore the past.

This is why we will use an Exponential Average to calculate a Smoothed Food Inspection Score that will be added as to the DataFrame as a column.

In [40]:
# The function used to calculate the exponential average 
lambda_ = 0.94

def EWMA(arr):
    avg = arr[0]
    t = len(arr)
    if t==1:
        return avg
    for i in range(1,t):
        avg = lambda_*avg + (1-lambda_)*arr[i]
    return avg
In [41]:
# More recent inspections have more weight on the score in 2019
# This will be useful for our final results comparing with the Yelp data

cfi_scores = cfi.copy(deep = True)
cfi_scores.index = pd.to_datetime(cfi_scores.pop('Inspection Date'))
cfi_scores_smoothed = cfi_scores.groupby(['License #'])['Food Inspection Score'].apply(EWMA)
In [42]:
cfi_scores_smoothed=pd.DataFrame(cfi_scores_smoothed)
cfi_scores_smoothed.columns = ['Smoothed Food Inspection Score']
cfi = cfi.merge(pd.DataFrame(cfi_scores_smoothed), on = 'License #')
In [43]:
cfi['Smoothed Food Inspection Score'].hist(bins = 25)
plt.title('Histogram of Smoothed Food Inspection Scores', Fontsize = 20)
plt.ylabel('Number of inspections')
plt.xlabel('Food Inspection Score Smoothed')
Out[43]:
Text(0.5, 0, 'Food Inspection Score Smoothed')

1.5 Group establishments for an establishment-centric dataframe

For the rest of the analysis, it will be handy to also have a dataframe, which is establishment-centric. This means that we regroup all inspections for every individual establishment.

In [44]:
# Generate groups according to license number
groups = cfi.groupby(['License #'])

# Aggregates other columns either in a list format or by taking the mode
columns = (list(cfi.columns))
columns.remove('License #')
columns_to_list = ['Inspection ID', 'Inspection Date', 'Inspection Type', 'Results', 'Violations', 'Score Violation', 'Food Inspection Score']

for i, col in enumerate(columns):
    if(col not in columns_to_list): # Take the mode value
        df = groups[col].agg(pd.Series.mode).reset_index(name=col)
    else: # Aggregate the resluts in a list
        df = groups[col].apply(list).reset_index(name=col)
        
    if(i==0):
        cfi_groups = df
    else:
        cfi_groups = cfi_groups.merge(df, on = ['License #'])

# The results of the transformations
cfi_groups.head(1)
Out[44]:
License # Inspection ID DBA Name Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Community Area Chain Score Violation Food Inspection Score Smoothed Food Inspection Score
0 2.0 [2144871, 2050308, 1977093, 1970902, 1970312, ... COSI Restaurant 1 230 W MONROE ST CHICAGO IL 60606 [2018-02-13 00:00:00, 2017-05-12 00:00:00, 201... [Canvass, Canvass, Short Form Complaint, Canva... [Pass w/ Conditions, Pass, Pass w/ Conditions,... [3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATU... 41.8808 -87.6347 32 None [1.0, 0.0, 2.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, ... [9.0, 10.0, 8.0, 10.0, 7.0, 9.0, 10.0, 10.0, 1... 9.33802

1.6 Final datasets

The establishments need to still be open today

Establishments must be visited at least every three years (for the lowest risk establishments, more for the higher risks). Thus, since we will be comparing our data with the Yelp dataset, we will remove the establishments that have not been inspected in the last three years. We assume that this means the establishment has shut down and thus a comparison with the current data in Yelp would be nonsensical.

First we will apply these transformations to the establishment-centric dataframe:

In [45]:
# Generate the new dataset with places that have been inspected at least once in the last three years 
print('Size CFI establishment-centric dataframe:', cfi_groups.shape[0])

# For the dataset with each row containing one establishment
cfi_groups['DBA Name'] = cfi_groups['DBA Name'].apply(lambda x: str(x).lower())
cfi_groups.loc[:, 'Last Inspection Year'] = cfi_groups['Inspection Date'].map(lambda x: x[0].year)
cfi_recent = cfi_groups[cfi_groups['Last Inspection Year'] > 2015].reset_index()
print('Size of CFI establishment-centric dataframe after dropping non-inspected places:', cfi_recent.shape[0])
Size CFI establishment-centric dataframe: 20316
Size of CFI establishment-centric dataframe after dropping non-inspected places: 12863

Doing the same for the inspection-centric dataframe:

In [46]:
# Get the inspection year
cfi['Inspection Year'] = cfi['Inspection Date'].map(lambda x: x.year)

Saving the cleaned datasets

In [47]:
cfi_insp = cfi.copy()

# Save the cleaned dataset: one row is one establishment
cfi_recent.to_csv('data/cfi_recent.csv')

# Save the cleaned dataset: each row if one inspection
cfi_insp.to_csv('data/cfi_insp.csv')

2. Yelp Reviews (YR) dataset

The data scraped from Yelp has been directly added on the clean version of our primary CFI dataset by adding 5 columns: the license number, the Yelp name of the business, the number of reviews, the rating and the associated food categories. We added the Yelp name just to see if the business obtained was the one that we were looking for. Thus, we will have for every restaurants, the rating of the business (which is computed as the mean of all the reviews) and the total number of reviews. For the format of these columns, the rating is a float, and the number of reviews is an int. Note that the rating only takes some finite values (from 0 to 5, with increments of 0.5).

Of course, we weren't able to obtain information on every restaurant. For instance, some of them might not have a Yelp page, other restaurants might have shutdown in the meantime and thus they don't have a Yelp page anymore, and finally in some cases, we might just have issues with the querying. On average, we were able to obtain information on roughly 75% of our dataset, which is a nice proportion taking into account all of the issues mentionned above. Note that by default the business for which we didn't have any Yelp information have the Yelp columns filled with NaN values.

2.1 Merge with CFI dataset

Note, to be able to do the scraping, we used the clean dataset called "cfi_recent". Indeed, in this dataset, each row represents one restaurant, and thus, we don't query twice for the same restaurant (if we would have used the inspection-centric dataset). Moreover, we won't have to do any cleaning on this dataset, as it is simply added on this previously cleaned dataset. The information obtained from Yelp is already in the right format. We might just, as stated above, have some lines with NaN values in our merged dataset (CFI + Yelp).

In [48]:
#Load the data scraped
data_scrapped = pd.read_csv('data/food-inspections_scrapped.csv', usecols=['License #','Yelp name','Yelp review count','Yelp rating','Yelp category'])
In [49]:
#Add the three Yelp columns on the cfi_recent dataframe
cfi_recent = cfi_recent.merge(data_scrapped, on='License #')
In [50]:
cfi_recent.head(1)
Out[50]:
index License # Inspection ID DBA Name Facility Type Risk Address City State Zip ... Community Area Chain Score Violation Food Inspection Score Smoothed Food Inspection Score Last Inspection Year Yelp name Yelp review count Yelp rating Yelp category
0 0 2.0 [2144871, 2050308, 1977093, 1970902, 1970312, ... cosi Restaurant 1 230 W MONROE ST CHICAGO IL 60606 ... 32 None [1.0, 0.0, 2.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, ... [9.0, 10.0, 8.0, 10.0, 7.0, 9.0, 10.0, 10.0, 1... 9.33802 2018 Cosi 70.0 3.0 [{'alias': 'sandwiches', 'title': 'Sandwiches'...

1 rows × 26 columns

2.2 Clean food categories

The food category labels of the establishments obtained by Yelp are not standardized and an incredible amount of different labels exists. Therefore, we regrouped the labels in 14 different cuisine categories to allow for better interpretability such that people can search for their favorite type of food.

In [51]:
def get_list(x):
    list_dict = eval(x)
    if(list_dict):
        list_ = []
        for d in list_dict:
            list_.append(d['alias'].replace(' ',''))
        return list_
    else: 
        return []
        
cfi_recent = cfi_recent.dropna(subset=['Yelp category'])
cfi_recent['Yelp category'] = cfi_recent['Yelp category'].map(get_list)
In [52]:
exploded = cfi_recent.explode(column='Yelp category')
In [53]:
american = ['tradamerican','newamerican','cajun','southern','poke','hawaiian','bbq','steak','diners','meats','cheesesteaks']
fast_food = ['hotdogs','burgers','chicken_wings','sandwiches','hotdog','bagels','foodstands','food_court',\
            'wraps','foodtrucks','chickenshop']
italian = ['pizza','italian','pastashops','sicilian']
bar = ['bars','cocktailbars','sportsbars','pubs','wine_bars','beerbar','breweries','divebars',\
       'beer_and_wine','gastropubs','irish_pubs','brewpubs','beergardens','whiskeybars','brasseries','cideries'\
      'tikibars','wineries','pianobars','champagne_bars','lounges']
seafood = ['seafood','seafoodmarkets']
south_american = ['mexican','tacos','tex-mex','empanadas','newmexican','caribbean','cuban','puertorican',\
                 'salvadoran','argentine','colombian','acaibowls','peruvian','brazilian','honduran','dominican',\
                 'venezuelan','haitian']
asian = ['chinese','sushi','thai','japanese','asianfusion','indpak','korean','vietnamese','pakistani','cantonese',\
         'szechuan','filipino','dimsum','izakaya','ramen','noodles','himalayan','taiwanese','mongolian','malaysian',\
        'panasian','shanghainese','burmese','teppanyaki','indonesian','bangladeshi']
sweet = ['bakeries','icecream','desserts','donuts','bubbletea','cupcakes','customcakes','gelato', 'chocolate',\
        'candy','waffles','popcorn','macarons','shavedice','cakeshop','pancakes']
coffee = ['coffee','cafes','coffeeroasteries','tea','juicebars']
europeean = ['mediterranean','greek','french','tapas','tapasmallplates','portuguese','spanish','creperies',\
            'german','latin','fishnchips','modern_european','irish','polish','ukrainian','british','scandinavian',\
            'czech','fondue','austrian','slovakian','belgian','scottish','delis']
mideastern = ['mideastern','turkish','falafel','persian','lebanese','uzbek','kebab','armenian','halal']
african = ['african','ethiopian','southafrican','moroccan','senegalese','somali','arabian']
special_diet = ['vegetarian','vegan','soup','salad','gluten_free']
breakfast_brunch = ['breakfast_brunch']

categories = [american, fast_food, italian, bar, seafood, south_american, asian, sweet, coffee, europeean, mideastern,\
             african, special_diet, breakfast_brunch]
categories_names = ['american', 'fast_food', 'italian', 'bar', 'seafood', 'south_american', 'asian', 'sweet', 'coffee',\
                    'europeean', 'mideastern', 'african', 'special_diet', 'breakfast_brunch']
In [54]:
def get_category(x, categories, categories_names):
    list_ = [];
    for name in x:
        for j, cat in enumerate(categories):
            cat_name = categories_names[j]
            if(name in cat and not cat_name in list_):
                list_.append(cat_name)
                break
    return list_
In [55]:
cfi_recent['Yelp category'] = cfi_recent['Yelp category'].apply(get_category, args = (categories, categories_names))
In [56]:
cfi_recent.head(1)
Out[56]:
index License # Inspection ID DBA Name Facility Type Risk Address City State Zip ... Community Area Chain Score Violation Food Inspection Score Smoothed Food Inspection Score Last Inspection Year Yelp name Yelp review count Yelp rating Yelp category
0 0 2.0 [2144871, 2050308, 1977093, 1970902, 1970312, ... cosi Restaurant 1 230 W MONROE ST CHICAGO IL 60606 ... 32 None [1.0, 0.0, 2.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, ... [9.0, 10.0, 8.0, 10.0, 7.0, 9.0, 10.0, 10.0, 1... 9.33802 2018 Cosi 70.0 3.0 [fast_food, american, coffee]

1 rows × 26 columns

In [57]:
cfi_recent.to_csv('data/final_cfi_recent.csv')

3. Crimes dataset

Let's now go on with the preprocessing of the last dataset.

3.1 Exploring columns

In [58]:
print(crimes.dtypes)
ID                              int64
Case Number                    object
Date                           object
Block                          object
IUCR                           object
Primary Type                   object
Description                    object
Location Description           object
Arrest                           bool
Domestic                         bool
Beat                            int64
District                      float64
Ward                          float64
Community Area                float64
FBI Code                       object
X Coordinate                  float64
Y Coordinate                  float64
Year                            int64
Updated On                     object
Latitude                      float64
Longitude                     float64
Location                       object
Historical Wards 2003-2015    float64
Zip Codes                     float64
Community Areas               float64
Census Tracts                 float64
Wards                         float64
Boundaries - ZIP Codes        float64
Police Districts              float64
Police Beats                  float64
dtype: object

The dataset is also available directly on Chicago's data portal, where it comes with a nice explanation of the columns that helps understand the meaning of all 30 features.

ID - Unique identifier for the record.
Case Number - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
Date - Date when the incident occurred. this is sometimes a best estimate.
Block - The partially redacted address where the incident occurred, placing it on the same block as the actual address.
IUCR - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.
Primary Type - The primary description of the IUCR code.
Description - The secondary description of the IUCR code, a subcategory of the primary description.
Location Description - Description of the location where the incident occurred.
Arrest - Indicates whether an arrest was made.
Domestic - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
Beat - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.
District - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.
Ward - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.
Community Area - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.
FBI Code - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.
X Coordinate - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
Y Coordinate - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
Year - Year the incident occurred.
Updated On - Date and time the record was last updated.
Latitude - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
Longitude - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
Location - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

In addition, we can see based on their website's information that in fact only 22 columns instead of 30 contain valuable information. The remaining 8 columns in fact pertain to an older classification system that we don't need to take into account. Therefore, we drop these columns for our analysis.

In [59]:
# drop old columns
cols = ['Historical Wards 2003-2015','Zip Codes', 'Community Areas', 'Census Tracts', 
        'Wards', 'Boundaries - ZIP Codes','Police Districts', 'Police Beats']
crimes.drop(cols, axis=1, inplace=True)
In [60]:
with pd.option_context('display.max_columns', None): display(crimes.head(1))
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic Beat District Ward Community Area FBI Code X Coordinate Y Coordinate Year Updated On Latitude Longitude Location
0 11034701 JA366925 01/01/2001 11:00:00 AM 016XX E 86TH PL 1153 DECEPTIVE PRACTICE FINANCIAL IDENTITY THEFT OVER $ 300 RESIDENCE False False 412 4.0 8.0 45.0 11 NaN NaN 2001 08/05/2017 03:50:08 PM NaN NaN NaN

3.2 Cleaning entries

As we have seen in the data collection part, the whole dataset includes almost 7M entries and 30 columns. However, only roughly 3M entries pertain to our study period (2010 until present) to match with the CFI dataset.

In [61]:
# Only use data from January 1st, 2010 onwards
crimes = crimes[crimes['Year'] > 2009]

Let's also have a look at any missing values for each feature:

In [62]:
crimes.isnull().sum()
Out[62]:
ID                          0
Case Number                 2
Date                        0
Block                       0
IUCR                        0
Primary Type                0
Description                 0
Location Description     5595
Arrest                      0
Domestic                    0
Beat                        0
District                    1
Ward                       59
Community Area            401
FBI Code                    0
X Coordinate            21028
Y Coordinate            21028
Year                        0
Updated On                  0
Latitude                21028
Longitude               21028
Location                21028
dtype: int64

Based on these results, we decide to drop all entries with a nan value in any feature except location description, as this will anyways not be part of our analysis.

In [63]:
nan_entries = crimes[['Case Number', 'District', 'Ward', 'Community Area', 'Latitude', 'Longitude']]\
                    .isnull().sum(axis=1)
print('There are {0} entries with at least one missing value in our targeted features'.format((nan_entries>0).sum()))
crimes = crimes[nan_entries<1]
print('After cleaning, there are {0} remaining samples'.format(crimes.shape[0]))
There are 21473 entries with at least one missing value in our targeted features
After cleaning, there are 2894407 remaining samples

3.3 Cleaning columns

Let's now turn our attention to the features in our dataset.

In [64]:
crimes.columns
Out[64]:
Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

Some columns will not be relevant for our analysis and we hence drop them. The X and Y coordinate are not necessary given that we work with the longitude/latitude data and location is simply an aggregate of the longitude and latitude. Also, we are not really interested in the location description and the update dates of the log entries for the purposes of our analysis.

In [65]:
# drop unnecessary columns
cols = ['Location Description', 'X Coordinate', 'Y Coordinate', 'Updated On', 'Location']
crimes.drop(cols, axis=1, inplace=True)

Also, we make sure that the data type of our features matches with its meaning and that it allows easy usage.

In [66]:
# convert floats to integers
crimes[['District', 'Ward', 'Community Area']] = crimes[['District', 'Ward', 'Community Area']].astype(int)

# convert booleans to integers (in python, True = 1 and False = 0)
crimes[['Arrest', 'Domestic']] = crimes[['Arrest', 'Domestic']].astype(int)

# structure the date and time for better usability
crimes['Date'] = crimes['Date'].apply(lambda d: datetime.strptime(d, '%m/%d/%Y %H:%M:%S %p'))

3.4 Detecting outliers

3.4.1 Geographical data

Let's check the crimes distribution by location:

In [67]:
plt.figure(figsize=(6,6))
plt.scatter(crimes['Latitude'], crimes['Longitude'])
plt.show()

We realize that there must be some entries that do not correspond to the city of Chicago, whose latitude/longitude is 41.8781° N, 87.6298° W. Therefore, we are going to exclude datapoints below 40° latitude and -90° longitude:

In [68]:
in_chigaco = (crimes['Longitude'] > -90) & (crimes['Latitude'] > 40)
print('There are {0} entries with a location outside of Chicago'.format(crimes.shape[0]-in_chigaco.sum()))
crimes = crimes[in_chigaco]
There are 76 entries with a location outside of Chicago

Let's recheck whether the geographical distribution now makes sense:

In [69]:
sns.jointplot(x=crimes['Latitude'], y=crimes['Longitude'], kind='hex')
plt.show()

3.4.2 Community area data

It is known that Chicago has 77 community areas, numbered from 1 to 77. Let's therefore discard any values different from this range.

In [70]:
areas = np.arange(1,78)
in_areas = crimes['Community Area'].isin(areas)
print('There are {0} entries with a wrong area code'.format(crimes.shape[0]-in_areas.sum()))
crimes = crimes[in_areas]
There are 9 entries with a wrong area code

Also, Chicago has 25 police districts, numbered 1 to 25.

In [71]:
districts = np.arange(1,26)
in_districts = crimes['District'].isin(districts)
print('There are {0} entries with a wrong district code'.format(crimes.shape[0]-in_districts.sum()))
crimes = crimes[in_districts]
There are 57 entries with a wrong district code
In [72]:
crimes.to_csv('crimes_clean.csv')

C. Descriptive analysis

1. Evolution of food inspection results

  • How did the food safety in Chicago evolve since the year 2010? (Overall & geographically)
  • Does the same area/restaurants remain safe over time?
In [73]:
cfi_recent.head(1)
Out[73]:
index License # Inspection ID DBA Name Facility Type Risk Address City State Zip ... Community Area Chain Score Violation Food Inspection Score Smoothed Food Inspection Score Last Inspection Year Yelp name Yelp review count Yelp rating Yelp category
0 0 2.0 [2144871, 2050308, 1977093, 1970902, 1970312, ... cosi Restaurant 1 230 W MONROE ST CHICAGO IL 60606 ... 32 None [1.0, 0.0, 2.0, 0.0, 3.0, 1.0, 0.0, 0.0, 0.0, ... [9.0, 10.0, 8.0, 10.0, 7.0, 9.0, 10.0, 10.0, 1... 9.33802 2018 Cosi 70.0 3.0 [fast_food, american, coffee]

1 rows × 26 columns

1.1 Heatmap of the restaurants in Chicago

First, let's plot a density map of Chicago's restaurant to have a better view of the repartition of the restaurant in the city.

In [74]:
cfi_recent.drop(cfi_recent[[str(cfi_recent['Latitude'].iloc[i])[0]=='[' for i in range(cfi_recent.shape[0])]].index, inplace=True)

area_geo_city = r'data/geojson/city.geojson'
geo_json_chicago_city = json.load(open(area_geo_city))
data = cfi_recent[['Latitude', 'Longitude', 'Yelp rating']].dropna()

restaurant_density = heat_map_chicago(data, geo_json_chicago_city)
restaurant_density.save('data/html/density.html')
#restaurant_density

The map can be seen here.

1.2 Ratio of pass/fail over time

Next, we analyze the ratio of pass/fail outcomes over the years to detect changes in the overall sanitation of the Chicago's food establishments.

In [75]:
# Convert the 'No Entry' and the 'Not Ready' into NaN values in the dataframe
cfi_insp[(cfi_insp['Results'] == 'No Entry') | (cfi_insp['Results'] == 'Not Ready')] = np.nan

# For each year, compute the pass/fail ratio
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

year_results = pd.DataFrame(index=years, columns=['Pass', 'Pass w conditions', 'Fail'])

for year in years:
    year_results['Pass w conditions'][year] = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                                                       (cfi_insp['Results'] == 'Pass w/ Conditions')].shape[0]
    year_results['Pass'][year] = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                                                      (cfi_insp['Results'] == 'Pass')].shape[0]
    year_results['Fail'][year] = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                                                      (cfi_insp['Results'] == 'Fail')].shape[0]

year_results = year_results.div(year_results.sum(axis=1), axis=0)    
year_results.head()
Out[75]:
Pass Pass w conditions Fail
2010 0.651955 0.100944 0.247101
2011 0.664268 0.105188 0.230544
2012 0.675195 0.11147 0.213335
2013 0.699294 0.114469 0.186237
2014 0.673113 0.136318 0.190569
In [76]:
import plotly.offline as po
fig3 = go.Figure(layout=go.Layout(
    xaxis_title="Year",
    yaxis_title="Proportion",
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(250,250,250,1)',
    margin=dict(t=20, b=20),
    legend=dict(x=1.01, y=0.5)
))

colors = ['#9ebc00', '#009ab0', '#fa5c18']

fig3.add_scatter(y=year_results["Fail"], name="Fail", mode='lines+markers', stackgroup='one', marker_color='#fa5c18')
fig3.add_scatter(y=year_results["Pass w conditions"], mode='lines+markers', name="Pass w conditions", stackgroup='one', marker_color='#009ab0')
fig3.add_scatter(y=year_results["Pass"], name="Pass", mode='lines+markers', stackgroup='one', marker_color='#9ebc00')
fig3.update_xaxes(ticktext=year_results.index.values, tickvals=[k for k in range(len(year_results.index.values))])
fig3.show()
po.plot(fig3, filename='site/img/inspection_results_time.html')
Out[76]:
'site/img/inspection_results_time.html'

Analysis of the results:

  • The results above show a seemingly significant decrease in the pass/fail ratio starting in 2018. This can be explained by the recent change in the definition of violations by the Chicago Department of Public Health’s Food Protection on the 1st of July 2018. The original description of the dataset mentionned that these changes in the violations could lead to a change in the distribution of the Results (Pass, Pass w/ Conditions or Fail). It seems that there has been an increase in the results with "Pass w/ Conditions" rather than simply "Pass" due to the changes that occurred in 2018.
  • Other than this significant decrease in the pass/fail ratio starting in 2018, the two curves follow the same trend, with 2013 being the most successful year.

1.2.1 Map of the pass/fail ratio per Community Area

We also aim to understand whether some community areas have lower food safety standards. Therefore, we plot the ratios of each community area in 2018.

In [77]:
# For each year, compute the pass/fail ratio
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
NUM_AREAS = 77

# Initialize DataFrames
ratio_passw_area = pd.DataFrame(0, index=list(range(1, NUM_AREAS+1)), columns=years) # will contain ratio of (pass + pass w/ cond)/fail
ratio_pass_area = pd.DataFrame(0, index=list(range(1, NUM_AREAS+1)), columns=years) # will contain ratio of pass/fail
ax = 0

# Drop null values of the Community Areas - we will not be able to map them
cfi_insp.dropna(subset = ['Community Area'], inplace = True)
cfi_insp['Community Area'] = cfi_insp['Community Area'].astype(int)

# for each year and each area, calculate the ratios
for year in years:
    for area in range(1, NUM_AREAS+1):
        pass_condi = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                              (cfi_insp['Results'] == 'Pass w/ Conditions') & \
                              (cfi_insp['Community Area'] == area)].shape[0]
        pass_ = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                         (cfi_insp['Results'] == 'Pass') & (cfi_insp['Community Area'] == area)].shape[0]
        fail_ = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                         (cfi_insp['Results'] == 'Fail') & (cfi_insp['Community Area'] == area)].shape[0]
        if fail_ != 0: 
            # Can not divide by zero
            ratio_passw_area.loc[area, year] = (pass_+pass_condi)/fail_
            ratio_pass_area.loc[area, year] = pass_/fail_
        else:
            ratio_passw_area.loc[area, year] = (pass_+pass_condi)
            ratio_pass_area.loc[area, year] = pass_
            
ratio_passw_area['code'] = ratio_passw_area.index.astype(str)
ratio_pass_area['code'] = ratio_pass_area.index.astype(str)
In [78]:
ratio_pass_area.head(2)
Out[78]:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 code
1 2.793103 2.536585 1.787234 2.363636 3.658537 2.102041 1.931818 4.891892 1.634146 0.646154 1
2 1.576923 1.636364 1.642857 2.281690 6.233333 9.217391 1.919192 1.798077 1.022727 1.181818 2
In [79]:
# define the map
m = crime_distribution(geo_json_data, ratio_passw_area, ['code', 2018])
m.save(outfile= "data/html/area_ratio_passw.html")
#m

The Community Area Pass/Fail ratio map can be found here.

In [80]:
m = crime_distribution(geo_json_data, ratio_pass_area, ['code', 2018])
m.save(outfile= "data/html/area_ratio_pass.html")
#m

The Pass with conditions + Pass / Fail ratio map can be found here.

1.2.2 The evolution of the Violation Score per community area

Extending our analysis from above, we also create a dynamic map that shows the pass/fail ratios in each community area over time.

In [81]:
# For each year, compute the average Inspection Violation Score in each community area
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
NUM_AREAS = 77

# Initialize DataFrames
score_violation_area = pd.DataFrame(0, index=list(range(1, NUM_AREAS+1)), columns=years) # will contain average violation score

# for each year and each area, calculate the mean
for year in years:
    for area in range(1, NUM_AREAS+1):
        data_area_year = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                              (cfi_insp['Community Area'] == area)]
        score_violation_area.loc[area, year] = data_area_year['Food Inspection Score'].mean(axis=0)
            
score_violation_area.head()
score_violation_area = score_violation_area.T
In [82]:
color_scale = np.array(['#053061','#2166ac','#4393c3','#92c5de','#d1e5f0','#fddbc7','#f4a582','#d6604d','#b2182b','#67001f'])

def color_coding(df):
    # determine range of values to get equally spaced bins
    max_val = df.max().max()
    min_val = df.min().min()
    width = (max_val - min_val)/7
    bin_edges = np.arange(min_val, max_val+1, width)

    # bin the values 
    idx = np.digitize(df, bin_edges, right=True)
    return color_scale[idx-1], bin_edges
In [83]:
colors, edges = color_coding(score_violation_area)
In [84]:
score_violation_area.index.name = 'Year'
score_violation_area.shape
Out[84]:
(10, 77)
In [85]:
map_ = create_the_map_crimes('P1Y', edges, color_scale, geo_json_data, score_violation_area, colors)
map_.save('data/html/Food_inspection_evolution_.html')
#map_

See the evolution of the violation score in each community area here

1.3 The Inspection Type over time

The inspection type can give valuable information as to whether the inspection was standard procedure, or due to a complaint. Thus, we will analyse the evolution of the inspection types over time.

In [86]:
# Making a list og the Inspection Types showing some sort of complaint
canvass = ['Canvass', 'Canvass Re-Inspection']
complaint = ['Complaint', 'Complaint Re-Inspection', 'Short Form Complaint', \
            'Complaint-Fire', 'Short Form Fire-Complaint', 'Complaint-Fire Re-inspection', \
            'Suspected Food Poisoning', 'Suspected Food Poisoning Re-inspection']
license = ['License', 'License Re-Inspection' 'License-Task Force', 'Pre-License Consultation']

inspections_types = [canvass, complaint, license]
inspections_names = ['Canvass', 'Complaint', 'License & Consultation', 'Other']
years_insp = pd.DataFrame(index=years, columns=inspections_names)

for i, insp in enumerate(inspections_names):
    for year in years:
        if(insp == 'Other'):
            years_insp[insp][year] = cfi_insp[(cfi_insp['Inspection Year'] == year)].shape[0] - years_insp.loc[year].sum()
        else:
            years_insp[insp][year] = cfi_insp[(cfi_insp['Inspection Year'] == year) & \
                                              (cfi_insp['Inspection Type'].isin(inspections_types[i]))].shape[0]

years_insp = years_insp.div(years_insp.sum(axis=1), axis=0)    
years_insp.head(2)
Out[86]:
Canvass Complaint License & Consultation Other
2010 0.475236 0.26427 0.109393 0.151101
2011 0.524076 0.248134 0.0990583 0.128731
In [87]:
colorsM = ['rgba(0,173,173,1)', 'rgba(231,57,129,1)', 'rgba(204,238,0,1)', 'rgba(255,138,0,1)']
colorsF = ['rgba(0,173,173,0.5)', 'rgba(231,57,129,0.5)', 'rgba(204,238,0,0.5)', 'rgba(255,138,0,0.5)']

fig4 = go.Figure(layout=go.Layout(
    xaxis_title="Year",
    yaxis_title="Proportion",
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(250,250,250,1)',
    margin=dict(t=20, b=20),
    legend=dict(x=1.01, y=0.5)
))
for i, col in enumerate(years_insp.columns):
    fig4.add_scatter(y=years_insp[col], name=col, mode='lines+markers', stackgroup='one', marker_color=colorsM[i], fillcolor=colorsF[i])
fig4.update_xaxes(ticktext=years_insp.index.values, tickvals=[k for k in range(len(years_insp.index.values))])
fig4.show()
po.plot(fig4, filename='site/img/inspection_type_time.html')
Out[87]:
'site/img/inspection_type_time.html'

1.4 Dynamic map of the evolution of food inspection results over time

To have a nice visualization of the evolution of food safety in Chicago, we created a dynamic maps using Folium. Note: for this task, it is better to use the inspection centric dataset ("cfi_insp").

In [88]:
with pd.option_context('display.max_columns', None): display(cfi_insp.head(1))
Inspection ID DBA Name License # Facility Type Risk Address City State Zip Inspection Date Inspection Type Results Violations Latitude Longitude Community Area Chain Score Violation Food Inspection Score Smoothed Food Inspection Score Inspection Year
0 2320248.0 THE REDHEAD PIANO BAR 2313942.0 Restaurant 3.0 16-18 W ONTARIO ST CHICAGO IL 60654.0 2019-10-22 License Pass w/ Conditions NaN 41.893371 -87.628783 8 None 0.0 10.0 10.0 2019.0
In [89]:
cfi_insp['color'] = cfi_insp['Results'].apply(results_to_color)
In [90]:
#Second, we plotted a map on which the restaurants stayed forever (duration = None)
#For this task, as the point disappear, we can take the full dataset
map_2 = create_the_map(cfi_insp, 68000, 'P1M', geo_json_chicago_area)
map_2.save('data/html/map_evolution_inspection_full.html')
#map_2

The map can be seen here.

2. Analysis of food chains

Are large food chains more prone to fail food inspections?

In [91]:
# We take only the data for wich the chain is known and remove the one having <10 restaurants
cfi_recent_chain = cfi_recent[~cfi_recent['Chain'].isin(['Wendy\'s','Baskin-Robbins','Hardee\'s','None'])]
cfi_insp_chain = cfi_insp[~cfi_insp['Chain'].isin(['Wendy\'s','Baskin-Robbins','Hardee\'s','None'])]

chain_list = list(cfi_recent_chain['Chain'].unique())
In [92]:
# Creating series with the inspection results for each chain
chain_pass = cfi_insp_chain[cfi_insp_chain['Results'] == 'Pass']['Chain'].value_counts().to_frame(name = 'Pass')
chain_passWC = cfi_insp_chain[cfi_insp_chain['Results'] == 'Pass w/ Conditions']['Chain'].value_counts().to_frame(name = 'Pass w/ Conditions')
chain_fail = cfi_insp_chain[cfi_insp_chain['Results'] == 'Fail']['Chain'].value_counts().to_frame(name = 'Fail')
chain = chain_pass.join(chain_passWC,how='outer').join(chain_fail,how='outer')
chain = chain.fillna(0)

# Normalize by total number of inspection results
chain['Sum'] = chain['Pass'] + chain['Fail'] + chain['Pass w/ Conditions']
chain['Pass'] = chain['Pass']/chain['Sum']
chain['Fail'] = chain['Fail']/chain['Sum']
chain['Pass w/ Conditions'] = chain['Pass w/ Conditions']/chain['Sum']
chain.drop(columns=['Sum'],inplace=True)
In [93]:
fig6 = go.Figure(layout=go.Layout(
    xaxis_title="Chain",
    yaxis_title="Proportion",
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    margin=dict(t=20, b=20),
    legend=dict(x=1.01, y=0.5)
))

colors = ['#9ebc00', '#009ab0', '#fa5c18']

for i, col in enumerate(chain.columns.values):
    fig6.add_bar(y=chain[col], name=col, marker_color=colors[i])
fig6.update_xaxes(ticktext=chain.index.values, tickvals=[k for k in range(len(chain.index.values))])
fig6.show()
po.plot(fig6, filename='site/img/chain_results.html')
Out[93]:
'site/img/chain_results.html'

We can see that starbucks has the lowest fail rate and the highest pass rate. All chains are not exceeding 25% of failure. Domino's has the lowest pass rate but also the highest pass with condition rate.

Here one can observe that most chain have one risk category associated to most of its restaurants. Only subway, kfc and taco bell are classified as risk 1. There are no obvious pattern considering this risk categorisation.

In [94]:
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]

# Initialize matrix to collect failure rates
ratio_fail = np.zeros([len(years),len(chain_list)])

for i, chain in enumerate(chain_list):
    activity_years = 0 # Counter of the number of years of activity (with at least one inspection)
    for j, year in enumerate(years):
        pass_condi = cfi_insp_chain[(cfi_insp_chain['Inspection Year'] == year)\
                                    & (cfi_insp_chain['Results'] == 'Pass w/ Conditions')\
                                    & (cfi_insp_chain['Chain'] == chain)].shape[0]
        pass_ = cfi_insp_chain[(cfi_insp_chain['Inspection Year'] == year)\
                                    & (cfi_insp_chain['Results'] == 'Pass')\
                                    & (cfi_insp_chain['Chain'] == chain)].shape[0]
        fail_ = cfi_insp_chain[(cfi_insp_chain['Inspection Year'] == year)\
                                    & (cfi_insp_chain['Results'] == 'Fail')\
                                    & (cfi_insp_chain['Chain'] == chain)].shape[0]
        
        # Number of inspections in this year
        sum_ = pass_+pass_condi+fail_
        
        if(j != 0): # If there was an inspection this year
            if(sum_ != 0):
                activity_years = activity_years + 1
                memory = memory + fail_/(sum_)
                ratio_fail[j,i] = memory/activity_years
            else: 
                memory = memory
                ratio_fail[j,i] = memory
        else: # If there wasn't any inspection this year
            if(sum_ != 0):
                activity_years = activity_years + 1
                memory = fail_/(sum_)
                ratio_fail[j,i] = memory/activity_years
            else: 
                memory = 0
                ratio_fail[j,i] = memory

ratio_years = pd.DataFrame(ratio_fail, index = years, columns = chain_list)
In [95]:
fig7 = go.Figure(layout=go.Layout(
    xaxis_title="Year",
    yaxis_title="Failure rate",
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(200,200,200,0.1)',
    margin=dict(t=20, b=20),
    legend=dict(x=1.01, y=0.5)
))
for col in ratio_years.columns.values:
    fig7.add_scatter(y=ratio_years[col], name=col)
fig7.update_xaxes(ticktext=ratio_years.index.values, tickvals=[k for k in range(len(ratio_years.index.values))])
fig7.show()
po.plot(fig7, filename='site/img/chain_evol.html')
Out[95]:
'site/img/chain_evol.html'

As we can see on the graph above, the failure rate seems to have been quite stable in the last couple of years. Papa John's Pizza has the highest failure rate, and Starbucks the lowest. The establishments that had "zero" failure rate (Domino's and Papa John's Pizza) and all the sudden had a higher failure rate are likely establishments which were not existant before the year where the failure rate switched to being non-zero.

3. Correlation analysis of hygienic standards of the food places and customer reviews

One of the questions that we wanted to address with the Yelp data was to see whether they were some correlation between the food inspection results and teh Yelp ratings. As stated before, we don't have the full set of ratings yet du to the Yelp API restriction. However, we still try to answer this question with the data that we presently have, knowing the fact that the conclusion that we draw now might change after with the full dataset.

3.1 Overview of the Yelp ratings and number of reviews

First, let's have a look at the distribution of the rating in our dataset.

In [96]:
cfi_recent['Yelp rating'].hist(bins = 50)
plt.title('Histogram of the yelp rating')
plt.xlabel('Yelp rating')
plt.ylabel('Number of occurences')
plt.show()

As stated before, this distribution is a discrete one, as the Yelp rating only take values between 0 and 5, with 0.5 increments. Moreover, note that this distribution is bounded between 0 (minimum grade) and 5 (maximum grade).

Now, let's plot the distribution of the number of reviews per restaurants.

In [97]:
cfi_recent['Yelp review count'].hist(bins = 50)
plt.title('Histogram of the number of reviews')
plt.xlabel('Yelp number of reviews')
plt.ylabel('Number of occurences')
plt.show()

Note that once again, this distribution is discrete as the number of reviews can only be integer. However, from the previous plot, it seems that the distribution obtained follow a powerlaw, with most of the bussiness having only a small number of reviews, but still quite some outliers. This can be explained by the fact that some restaurants are very famous in Chicago, and thus obtained lot's of reviews on Yelp. One interesting thing to see is wheter the most famous places are also the "best" places. To do this, we can simply plot a scatter plot of the number vs the ratings.

In [98]:
plt.scatter(cfi_recent['Yelp rating'], cfi_recent['Yelp review count'])
plt.grid()
plt.xlabel('Rating')
plt.ylabel('Number of reviews')
plt.show()

For now, and with the data that we have, there is no clear correlation between the number of reviews and the rating. Of course, the extremes ratings (0 and 5), always have very few number of reviews. It makes sense to assume the fact that after a certaing number of reviews, the restaurant cannot have a 5 star or a 1 star.

Now, let's just put a dual map of the Yelp rating. Plot on a map the restaurants with their rating on Yelp. On this plot, the color of each points represent the rating of the restaurants. Note that due to avoid to have a very crowded map, we only plotted some points and not the all dataset. Plotting two maps enables us to see the difference between two groups of restaurants. Namely here we choose to highlhight the difference between the restaurants with the highest number of reviews on Yelp and the one with the smallest number of reviews.

In [99]:
N_top = 100 #Display the 100 first restaurants in term of number of review
data = cfi_recent[['Latitude', 'Longitude', 'Yelp name', 'Yelp review count', 'Yelp rating']].dropna().sort_values(by='Yelp review count', ascending=False) #.sample(50)
data['Yelp color']=data['Yelp rating'].apply(yelp_to_color)

chicago_dual_map_ = chicago_dual_map(data, 100, geo_json_chicago_area)
chicago_dual_map_.save('data/html/dual.html')
#chicago_dual_map_

The dual map can be seen here

3.2 Correlation between the food inspection fail rate of areas and Yelp rating

Now, let's go back to our initial question and try to analyse whether we can find some correlation between the inspections and the Yelp rating.

First, let's have a look at the correlation between the CFI score ('Smoothed Food Inspection Score') and the Yelp rating.

In [100]:
data = cfi_recent[['Smoothed Food Inspection Score', 'Yelp rating', 'Yelp review count']].dropna()
rho, p = ss.pearsonr(data['Yelp rating'], data['Smoothed Food Inspection Score'])
print('The correlation coefficient between the CFI score and the Yelp rating is '+str(np.around(rho, 3))+', with a p-value of',np.around(p, 5))
The correlation coefficient between the CFI score and the Yelp rating is -0.036, with a p-value of 0.0003

As we can see on the above cell and quite surprisingly, the CFI score and the Yelp rating are negatively correlated. Although this coefficient seems small at a first glance, it is statistically different from 0 as we can see with the above p-value.

In [101]:
data = cfi_recent[['Smoothed Food Inspection Score', 'Yelp rating', 'Yelp review count', 'Chain']].copy(deep=True)
data[data['Chain']=='None']=np.NaN
data.dropna(inplace=True)
rho, p = ss.pearsonr(data['Yelp rating'], data['Smoothed Food Inspection Score'])
print('The correlation coefficient between the CFI score and the Yelp rating for the chains is '+str(np.around(rho, 5))+', with a p-value of',np.around(p, 3))
The correlation coefficient between the CFI score and the Yelp rating for the chains is -3e-05, with a p-value of 1.0

Although the correlation is statistically different form 0 for the whole dataset, this coefficient is not statistically different from 0 for the chains restaurants.

In [102]:
data = cfi_recent[['Smoothed Food Inspection Score', 'Yelp rating', 'Yelp review count']].dropna().sort_values('Yelp review count', ascending=False).iloc[0:500]
rho, p = ss.pearsonr(data['Yelp rating'], data['Smoothed Food Inspection Score'])
print('The correlation coefficient between the CFI score and the Yelp rating for the most famous restaurants is '+str(np.around(rho, 3))+', with a p-value of',np.around(p, 3))
The correlation coefficient between the CFI score and the Yelp rating for the most famous restaurants is -0.057, with a p-value of 0.204

Note that this same coefficient is also not significant for the top restaurants in term of the number of review on Yelp, i.e. for the most famous restaurants. But is this really due to the fact that these restaurants are "special" or is it only due to the fact that we are taking a smaller subset, and this would be a nice illustration of Symspon Paradox in this case! Let's check this!

In [103]:
data = cfi_recent[['Smoothed Food Inspection Score', 'Yelp rating', 'Yelp review count']].dropna().sort_values('Yelp review count', ascending=False)
rho_s = []
p_s = []
n_cat = 500
for i in range(1,int(data.shape[0]/n_cat)):
    temp = data.iloc[n_cat*(i-1):n_cat*i]
    rho, p = ss.pearsonr(temp['Yelp rating'], temp['Smoothed Food Inspection Score'])
    rho_s.append(rho)
    p_s.append(p)

p_s = np.array(p_s)
print('The mean of the p-values of the correlation coefficients between the CFi score and the Yelp rating score, where we took a rolling window of 500, and ordered the dataset by the number of review on Yelp, is given by',np.mean(np.around(np.mean(p_s),2)))
plt.plot(rho_s, label='Correlation coefficient')
print('And among the '+str(i)+' groups, there are only '+str(p_s[p_s<=0.05].shape[0])+' groups which have a p-values <= 0.05!')
plt.plot(p_s, label='p-values')
plt.legend()
plt.show()
The mean of the p-values of the correlation coefficients between the CFi score and the Yelp rating score, where we took a rolling window of 500, and ordered the dataset by the number of review on Yelp, is given by 0.39
And among the 19 groups, there are only 0 groups which have a p-values <= 0.05!

Thus, as we can see on the cell above, most of the coefficient are not statistically different from 0, although the coefficient using the all sample is statistically different from 0! That's a nice illustration of Sympson paradox.

4. Patterns of neighborhood crime rates

4.1 Distribution of crimes among community areas

Chicago has 77 different community areas and our crimes dataset contains information about the community area where the incident occurred. Therefore, let's see whether there are some distinct patterns among these areas.

In [104]:
# get the total number of incidents per community area
cts = crimes['Community Area'].value_counts()
areas = pd.DataFrame(cts.values, cts.index, columns=["crimes"])
areas['code'] = areas.index
# convert the code into the string to allow for matching with geojson data later on
areas['code'] = areas['code'].apply(str)
areas.head(2)
Out[104]:
crimes code
25 182705 25
8 107130 8

We can now have a look at the data in a bar plot and see that there are huge differences among the different areas.

In [105]:
plt.bar(cts.index, cts.values)
plt.xlabel('Community Areas (from 1 to 77)')
plt.ylabel('Nbr of crimes from 2010 to present')
plt.title('Community Area Comparison')
plt.show()

The bar plot above shows that the community area 25 seems to have the highest number of crimes since 2010.

4.2 Geographical distribution of crimes

Let's turn to folium and have a look at the different community area boundaries using Chicago's Data Portal geojson data in the map visible here with the total number of crimes per community area since 2010.

In [106]:
# define the map
m = crime_distribution(geo_json_data, areas, ['code', 'crimes'])
m.save(outfile= "data/html/area_counts.html")
#m

In the map visible here, we can now see that the different areas have a very different number of crimes as suggested by the barplot.

4.2.1 Distribution of crime rate per population size

However, the incidence rate per population size would be even more informative to know which areas are problematic. To do so, we use the census data by community area from 2017 found here as a proxy for the population from 2010 to the present in each area.

In [107]:
census = pd.read_csv('data/census.csv', sep=';')
census.head(2)
Out[107]:
code count
0 1 53470
1 2 75185
In [108]:
# convert the area code to int for the merge 
areas['code'] = areas['code'].astype(int)
areas = pd.merge(areas, census, how='inner', on='code')
# calculate the proportional incidence rate 
areas['prop'] = areas['crimes']/areas['count']
# reconvert the area code to string for binding with geojson data
areas['code'] = areas['code'].apply(str)
areas.head(2)
Out[108]:
crimes code count prop
0 182705 25 97604 1.871901
1 107130 8 96466 1.110547
In [109]:
# define the map
m = crime_distribution(geo_json_data, areas, ['code', 'prop'])
m.save(outfile= "data/html/area_crimerate.html")
#m

In the map visible here, we can observe that the results have actually changed quite drastically compared to the absolute number of crimes. There are now more areas with a relatively elevated incidence score.

4.2.2 Distribution of crime rate per surface area

As it is difficult to get census data on all years we consider, we believe that it is more informative in the end to look at the crime density per surface area of each community area to understand the likelihood of a crime happening at a certain location. This also has the convenience that the surface area remains constant over the considered time period.

In [110]:
# add a column on surface area of the each community area
from area import area
areas['surface'] = np.nan
for feature in geo_json_data['features']:
    idx = int(feature['properties']['area_num_1'])-1
    areas.loc[idx,'surface'] = area(feature['geometry'])

# calculate the number of crimes per surface area
areas['crimes density'] = areas['crimes']/areas['surface']    
In [111]:
m = crime_distribution(geo_json_data, areas, ['code', 'crimes density'])
m.save(outfile= "data/html/area_crimedensity.html")
#m

In the map visible here, we can now see that the crime density based on the community area's surface is quite similar to the initial plot of absolute number of crimes independent of the other area's attributes.

4.2.3 Distribution of crime rate per surface area in 2019

Now, let's just look at the crimes in 2019.

In [112]:
recent_crimes = crimes[crimes['Year'] == 2019]
cts = recent_crimes['Community Area'].value_counts()
n_areas = pd.DataFrame(cts.values, cts.index, columns=["crimes"])
n_areas['code'] = n_areas.index

# add a column on surface area of the each community area
n_areas['surface'] = np.nan
for feature in geo_json_data['features']:
    idx = int(feature['properties']['area_num_1'])
    n_areas.loc[idx,'surface'] = area(feature['geometry'])/1000000

# calculate the number of crimes per surface area
n_areas['crimes density'] = n_areas['crimes']/n_areas['surface']

# reconvert the area code to string for binding with geojson data
n_areas['code'] = n_areas['code'].apply(str)
In [113]:
# define the map
crimes_map = interactive_crime(n_areas, geo_json_data)
crimes_map.save(outfile= "data/html/area_crimedensity2019.html")
#crimes_map

In the map visible here, we can see that the crime density based on the community area's surface is quite similar to the initial plot of absolute number of crimes independent of the other area's attributes.

4.3 Comparison with violent crimes

Just looking at the mere total number of crimes does not inform us where the most violent crimes happen and which areas are really dangerous to go to. Therefore, we separately look at the four worst types of crimes: criminal sexual assault, battery, kidnapping and homicide.

In [114]:
crimes['Primary Type'].unique()
Out[114]:
array(['SEX OFFENSE', 'OFFENSE INVOLVING CHILDREN', 'BATTERY', 'ASSAULT',
       'THEFT', 'STALKING', 'CRIMINAL DAMAGE', 'MOTOR VEHICLE THEFT',
       'OTHER OFFENSE', 'NARCOTICS', 'CRIM SEXUAL ASSAULT', 'ROBBERY',
       'INTERFERENCE WITH PUBLIC OFFICER', 'CRIMINAL TRESPASS',
       'DECEPTIVE PRACTICE', 'BURGLARY', 'WEAPONS VIOLATION',
       'PROSTITUTION', 'GAMBLING', 'ARSON', 'PUBLIC PEACE VIOLATION',
       'KIDNAPPING', 'INTIMIDATION', 'CONCEALED CARRY LICENSE VIOLATION',
       'LIQUOR LAW VIOLATION', 'PUBLIC INDECENCY',
       'OTHER NARCOTIC VIOLATION', 'OBSCENITY', 'HUMAN TRAFFICKING',
       'HOMICIDE', 'NON-CRIMINAL', 'NON - CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)'], dtype=object)
In [115]:
violent_crimes = ['CRIM SEXUAL ASSAULT','BATTERY','KIDNAPPING', 'HOMICIDE']
In [116]:
violent = crimes[crimes['Primary Type'].apply(lambda x: x in violent_crimes)]
violent_2019 = violent[violent['Year'] == 2019]
cts = violent_2019['Community Area'].value_counts()
n_areas = pd.DataFrame(cts.values, cts.index, columns=["crimes"])
n_areas['code'] = n_areas.index

# add a column on surface area of the each community area (in squarekilometers)
from area import area
n_areas['surface'] = np.nan
for feature in geo_json_data['features']:
    idx = int(feature['properties']['area_num_1'])
    n_areas.loc[idx,'surface'] = area(feature['geometry'])/1000000

# calculate the number of crimes per surface area
n_areas['crimes density'] = n_areas['crimes']/n_areas['surface']

# replace nan values
n_areas.fillna(0, inplace=True)
n_areas.loc[9,'code']=9

# reconvert the area code to string for binding with geojson data
n_areas['code'] = n_areas['code'].apply(int).apply(str)
In [117]:
n_areas.head(2)
Out[117]:
crimes code surface crimes density
25 2930 25 18.524966 158.164936
43 1747 43 7.606158 229.682320
In [118]:
# define the map
violent_crimes_map = interactive_crime(n_areas, geo_json_data)
violent_crimes_map.save(outfile= "data/html/area_crimedensity2019_violent.html")
#violent_crimes_map

The map is visible here

4.4 Evolution of crimes over the years

We have a look at the evolution of the crimes over the period from 2010 to 2018. The current year is excluded as it would give a false impression of a decline due to roughly 2 months of missing data. Just normalizing by the number of months we have data on and then project the crimes for the whole year does not work either in a satisfying manner as crime rates underlie strong seasonal fluctuations.

In [119]:
# get the number of crimes per community area and year until 2018
past_crimes = crimes[crimes['Year'] != 2019]
crimes_grouped = past_crimes.groupby(['Community Area', 'Year']).ID.count()
crimes_grouped = pd.DataFrame(crimes_grouped.values, crimes_grouped.index, columns=["Crimes"])
# create a column for each community area
crimes_grouped = crimes_grouped.unstack(level=0)
crimes_grouped.head(2)
Out[119]:
Crimes
Community Area 1 2 3 4 5 6 7 8 9 10 ... 68 69 70 71 72 73 74 75 76 77
Year
2010 5570 4877 5070 2559 2059 7458 5795 11584 303 1439 ... 10161 9365 3473 10654 1285 4544 916 3157 1875 3667
2011 5113 4416 4772 2315 1943 7522 5281 11213 364 1437 ... 9315 8917 3132 10248 1235 4152 727 2906 1923 3336

2 rows × 77 columns

In [120]:
crimes_grouped.plot(legend=None)
plt.title('Number of crimes per community area over time', Fontsize = 20)
plt.ylabel('Number of crimes', Fontsize = 16)
plt.xlabel('Year', Fontsize = 16)
plt.show()
In [121]:
# take away the multi-indexing to facilitate manipulations on the data
density_grouped = crimes_grouped.Crimes.rename_axis([None], axis=1).reset_index()
density_grouped.set_index('Year', inplace=True)

# normalize the number of crimes by the surface area: nbr of crimes per squarekilometer
for area in density_grouped.columns:
    # area in squarekilometers
    surf = areas['surface'][area-1]/1000000
    density_grouped[area] = density_grouped[area]/surf
In [122]:
density_grouped.plot(legend=None)
plt.title('Number of crimes per community area over time', Fontsize = 20)
plt.ylabel('Number of crimes', Fontsize = 16)
plt.xlabel('Year', Fontsize = 16)
plt.show()

4.5 Create a time-evolution map

For better visual analysis, we create a map with the crime density in each community area evolving over time.

4.5.1 Define the color coding

In [123]:
color_scale = np.array(['#053061','#2166ac','#4393c3','#92c5de','#d1e5f0','#fddbc7','#f4a582','#d6604d','#b2182b','#67001f'])

def color_coding(df):
    # determine range of values to get equally spaced bins
    max_val = df.max().max()
    min_val = df.min().min()
    width = (max_val - min_val)/10
    bin_edges = np.arange(min_val, max_val+1, width)
    print(min_val, max_val)

    # bin the values 
    idx = np.digitize(df, bin_edges, right=True)
    return color_scale[idx-1], bin_edges
In [124]:
colors, edges = color_coding(density_grouped)
37.065113966341414 2475.9885236392984

4.5.2 Define the map

In [125]:
map_1 = create_the_map_crimes('P1Y', edges, color_scale, geo_json_data, density_grouped, colors)
map_1.save('data/html/crimedensity_evolution.html')
#map_1

In the map visible here, we can observe that the crime density for a given community area actually changes in some areas quite a bit over time

4.6 Correlation analysis of hygienic standards and public safety

Are there correlates between the food inspection fail rate of areas and public safety, measured through crime rates?

In [126]:
smoothed_score = cfi_recent[['Smoothed Food Inspection Score', 'Community Area']].groupby('Community Area').mean()
smoothed_score['code'] = smoothed_score.index
n_areas['code'] = n_areas['code'].astype('int64')
data = smoothed_score.merge(n_areas)
In [127]:
rho, p = ss.pearsonr(data['crimes density'], data['Smoothed Food Inspection Score'])
print('The correlation coefficient between the CFI score and the crime density.htmlty is '+str(np.around(rho, 3))+', with a p-value of',np.around(p, 10))
The correlation coefficient between the CFI score and the crime density.htmlty is 0.257, with a p-value of 0.0259712946

5 Concluding analysis

  • Where can one safely eat delicious food in Chicago?

  • Do certain neighborhoods offer a better overall dining experience based on customer reviews, and food as well as neighborhood safety?

In [128]:
#For the crime zones
recent_crimes = crimes[crimes['Year'] == 2019]
cts = recent_crimes['Community Area'].value_counts()
n_areas = pd.DataFrame(cts.values, cts.index, columns=["crimes"])
n_areas['code'] = n_areas.index

# add a column on surface area of the each community area
n_areas['surface'] = np.nan
from area import area
for feature in geo_json_data['features']:
    idx = int(feature['properties']['area_num_1'])
    n_areas.loc[idx,'surface'] = area(feature['geometry'])

# calculate the number of crimes per surface area
n_areas['crimes density'] = n_areas['crimes']/n_areas['surface']

# reconvert the area code to string for binding with geojson data
n_areas['code'] = n_areas['code'].apply(str)
In [129]:
#cfi_recent['Yelp category'] = cfi_recent['Yelp category'].apply(lambda x: eval(x))
cfi_recent.drop(cfi_recent[cfi_recent['Yelp category']=='[]'].index, inplace=True)
data = cfi_recent[['DBA Name', 'Longitude', 'Latitude', 'Yelp rating', 'Yelp category', 'Violations', 'Facility Type', 'Smoothed Food Inspection Score']].dropna() 
list_category = pd.unique(data.explode(column='Yelp category')['Yelp category'].dropna())

q_25 = data['Smoothed Food Inspection Score'].quantile(0.25)
q_50 = data['Smoothed Food Inspection Score'].quantile(0.5)
q_75 = data['Smoothed Food Inspection Score'].quantile(0.75)

data['CFI color']=data['Smoothed Food Inspection Score'].apply(cfi_score_to_color, q_25=q_25, q_50=q_50, q_75=q_75)
for i in range(len(geo_json_data['features'])):
    geo_json_data['features'][i]['properties']['community']=geo_json_data['features'][i]['properties']['community'].lower().capitalize()

m_tab = final_plot(data, geo_json_data, list_category, n_areas)
for j in range(len(m_tab)):
    m_tab[j].save('data/html/final_'+str(j+1)+'.html')

Congratulations for bearing with us until the end! ;)

You can discover our final plot in our data story.

In [ ]: